Final assignment programming 1¶

About the data¶

The data comes from the Lifelines cohort. Lifelines is a large, multi-generational, prospective cohort study that includes over 167,000 participants (10%) from the northern population of the Netherlands. Within this cohort study the participants are followed over a 30-year period. Every five years, participants visit one of the Lifelines sites in the northern parts of the Netherlands for an assessment. During these visits, several physical measurements are taken and different biomaterials are collected. As part of the assessment, participants are asked to fill out comprehensive questionnaires. In between assessments, participants are invited to complete follow-up questionnaires approximately once every 1.5 years.

The data is aggregated on age groups in order to preserve the privacy of participants.The data is offered in two variants: complete and continuous. The complete variant contains all respondents that participated at one or both time points (T1 or T2).

For the research question, the change over time for a single group of participants was not of interest, so the complete data set was used. In this assignment, two parameters are selected and investigated to see if there is a relation.

Research question¶

Is there a relation between lower education level and osteoarthritis in the lifelines health dataset?

Education:¶

For the public dataset the percentage of participants whose highest education is ‘lower education’ is shown.
Lower education includes:

  • No education (did not finish primary school)
  • Primary education (primary school, special needs primary school)
  • Lower or preparatory secondary vocational education (such as lts, leao, lhno, vmbo)
  • Junior general secondary education (such as mavo, (m)ulo, mbo-short, vmbo-t

Osteoarthritis:¶

Osteoarthritis, or joint degradation, is measured as the percentage of participants that suffer from this condition.

More information on the cohurt study can be found on the lifelines wiki and the Toolkit - Aletta Year Challenge 2022.pdf
https://wiki-lifelines.web.rug.nl/doku.php?id=cohort
https://wiki-lifelines.web.rug.nl/doku.php?id=1a


Libraries:¶

In [1]:
import yaml
import pandas as pd
import numpy as np

#Statistics
import statsmodels.api as sm
from scipy.stats import shapiro

# Visualization:
import seaborn as sns
import hvplot.pandas
import matplotlib.pyplot as plt

# Bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, output_file, show
from bokeh.models import (BasicTicker, ColorBar, ColumnDataSource,
                          LinearColorMapper, PrintfTickFormatter, HoverTool)
from bokeh.transform import transform, dodge, factor_cmap, factor_mark
from bokeh.palettes import Viridis256
output_notebook()

# Panel
import panel as pn
pn.extension()
/usr/local/lib/python3.9/dist-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.2' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED
/usr/local/lib/python3.9/dist-packages/pandas/core/arrays/masked.py:59: UserWarning: Pandas requires version '1.3.2' or newer of 'bottleneck' (version '1.2.1' currently installed).
  from pandas.core import (
Loading BokehJS ...

Necessary functions:

In [2]:
from Summary_functions_and_code_clean_v11 import DS_Q_Q_Hist, DS_Q_Q_Plot

Load the data¶

In [3]:
with open("config.yaml", 'r') as stream:
    config = yaml.safe_load(stream)
age = config['lifelines_age']
In [4]:
# Name the dataframes
# Encoding utf_16 is necessary to read the table
age = pd.read_table(age, encoding='utf_16')

Cleaning and inspecting the data¶

In [5]:
# Inspect dataframe data aggregated on age
age.head()
Out[5]:
AGE GENDER GROUP_SIZE_CAT AGE_T1 AGE_T2 BMI_T1 WEIGHT_T1 HIP_T1 WAIST_T1 HEIGHT_T1 ... V_SUM_T1 D_SUM_T1 LTE_SUM_T1 LDI_SUM_T1 LTE_SUM_T2 LDI_SUM_T2 NSES_YEAR NSES MENTAL_DISORDER_T1 MENTAL_DISORDER_T2
0 25 1 2 17,15 20,00 21,53 72,20 89,85 78,03 182,96 ... 17,85 26,57 0,97 2,23 0,57 2,30 2010,00 -0,67 0,33 1,07
1 25 2 4 17,20 19,96 22,27 64,91 91,80 75,34 170,67 ... 20,21 26,35 1,01 3,30 0,66 3,37 2010,00 -0,67 1,40 1,87
2 26 1 3 17,89 21,06 22,15 74,92 91,49 79,89 183,78 ... 17,96 26,16 1,00 2,32 0,51 2,46 2010,00 -0,66 0,41 2,11
3 26 2 5 18,08 21,07 22,57 65,84 92,87 76,39 170,78 ... 20,49 26,63 1,12 3,43 0,81 3,72 2010,00 -0,69 1,60 3,20
4 27 1 3 18,74 21,84 22,57 75,66 92,13 80,82 182,94 ... 18,38 26,37 1,03 2,51 0,79 2,41 2010,00 -0,86 1,11 1,76

5 rows × 72 columns

In [6]:
age.tail()
Out[6]:
AGE GENDER GROUP_SIZE_CAT AGE_T1 AGE_T2 BMI_T1 WEIGHT_T1 HIP_T1 WAIST_T1 HEIGHT_T1 ... V_SUM_T1 D_SUM_T1 LTE_SUM_T1 LDI_SUM_T1 LTE_SUM_T2 LDI_SUM_T2 NSES_YEAR NSES MENTAL_DISORDER_T1 MENTAL_DISORDER_T2
117 83 2 2 73,70 77,37 27,54 73,47 103,96 93,38 163,34 ... 21,00 27,33 1,09 0,72 0,77 0,61 2009,36 -0,87 1,68 1,82
118 84 1 1 74,81 78,12 26,98 84,14 100,81 100,53 176,55 ... 23,00 31,00 0,87 0,64 0,83 0,43 2009,26 -0,79 1,04 0,30
119 84 2 2 74,74 78,32 28,16 75,62 105,17 95,49 163,86 ... 1,19 0,88 0,87 0,59 2009,28 -0,77 1,63 1,00
120 85 1 1 75,85 79,37 26,75 83,09 100,36 100,25 176,34 ... 1,01 0,86 0,91 0,54 2009,50 -0,74 1,07 1,16
121 85 2 1 76,05 79,44 27,42 72,83 104,52 92,67 162,94 ... 1,15 0,83 0,88 0,69 2009,42 -0,79 0,97 2,12

5 rows × 72 columns

In [7]:
# There seems to be missing data
# Check info of dataframe for dtype, missing values, columns
age.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 72 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   AGE                     122 non-null    int64 
 1   GENDER                  122 non-null    int64 
 2   GROUP_SIZE_CAT          122 non-null    int64 
 3   AGE_T1                  122 non-null    object
 4   AGE_T2                  122 non-null    object
 5   BMI_T1                  122 non-null    object
 6   WEIGHT_T1               122 non-null    object
 7   HIP_T1                  122 non-null    object
 8   WAIST_T1                122 non-null    object
 9   HEIGHT_T1               122 non-null    object
 10  BMI_T2                  122 non-null    object
 11  WEIGHT_T2               122 non-null    object
 12  HIP_T2                  122 non-null    object
 13  WAIST_T2                122 non-null    object
 14  HEIGHT_T2               122 non-null    object
 15  EDUCATION_LOWER_T1      122 non-null    object
 16  EDUCATION_LOWER_T2      122 non-null    object
 17  WORK_T1                 122 non-null    object
 18  WORK_T2                 122 non-null    object
 19  LOW_QUALITY_OF_LIFE_T1  122 non-null    object
 20  LOW_QUALITY_OF_LIFE_T2  122 non-null    object
 21  DBP_T1                  122 non-null    object
 22  DBP_T2                  122 non-null    object
 23  HBF_T1                  122 non-null    object
 24  HBF_T2                  122 non-null    object
 25  MAP_T1                  122 non-null    object
 26  MAP_T2                  122 non-null    object
 27  SBP_T1                  122 non-null    object
 28  SBP_T2                  122 non-null    object
 29  HTN_MED_T1              122 non-null    object
 30  CHO_T1                  122 non-null    object
 31  GLU_T1                  122 non-null    object
 32  CHO_T2                  122 non-null    object
 33  GLU_T2                  122 non-null    object
 34  RESPIRATORY_DISEASE_T1  122 non-null    object
 35  PACKYEARS               122 non-null    object
 36  SMOKING                 122 non-null    object
 37  METABOLIC_DISORDER_T1   122 non-null    object
 38  METABOLIC_DISORDER_T2   122 non-null    object
 39  LLDS                    122 non-null    object
 40  Quintile_LLDS           122 non-null    object
 41  ALCOHOL_INTAKE_T1       122 non-null    object
 42  KCAL_INTAKE_T1          122 non-null    object
 43  MWK_VAL                 122 non-null    object
 44  SCOR_VAL                122 non-null    object
 45  MWK_NO_VAL              122 non-null    object
 46  SCOR_NO_VAL             122 non-null    object
 47  SPORTS_T1               122 non-null    object
 48  CYCLE_COMMUTE_T1        122 non-null    object
 49  VOLUNTEER_T1            122 non-null    object
 50  PREGNANCIES             122 non-null    object
 51  OSTEOARTHRITIS_T1       122 non-null    object
 52  SLEEP_QUALITY           122 non-null    object
 53  DIAG_CFS_CDC            122 non-null    object
 54  DIAG_FIBROMYALGIA_ACR   122 non-null    object
 55  DIAG_IBS_ROME3          122 non-null    object
 56  C_SUM_T1                122 non-null    object
 57  A_SUM_T1                122 non-null    object
 58  SC_SUM_T1               122 non-null    object
 59  I_SUM_T1                122 non-null    object
 60  E_SUM_T1                122 non-null    object
 61  SD_SUM_T1               122 non-null    object
 62  V_SUM_T1                122 non-null    object
 63  D_SUM_T1                122 non-null    object
 64  LTE_SUM_T1              122 non-null    object
 65  LDI_SUM_T1              122 non-null    object
 66  LTE_SUM_T2              122 non-null    object
 67  LDI_SUM_T2              122 non-null    object
 68  NSES_YEAR               122 non-null    object
 69  NSES                    122 non-null    object
 70  MENTAL_DISORDER_T1      122 non-null    object
 71  MENTAL_DISORDER_T2      122 non-null    object
dtypes: int64(3), object(69)
memory usage: 68.8+ KB
In [8]:
f"The number of entries is {len(age)}"
Out[8]:
'The number of entries is 122'
  • There are 72 variables (columns) in total in the dataframe
  • Except for the first three columns all the datatypes are objects. The values are averages of the age group, so the dtype should be changed to floats.
  • Also, the missing values don't register as missing values. Spaces are used to fill in the missing values.
In [9]:
# Replace comma's with dots for the averages
age = age.replace(',', '.',regex=True)
In [10]:
# Replace the whitespaces with NaN values
age = age.replace(r'\s+', np.nan, regex=True)
In [11]:
# Check dataframe
age.tail()
Out[11]:
AGE GENDER GROUP_SIZE_CAT AGE_T1 AGE_T2 BMI_T1 WEIGHT_T1 HIP_T1 WAIST_T1 HEIGHT_T1 ... V_SUM_T1 D_SUM_T1 LTE_SUM_T1 LDI_SUM_T1 LTE_SUM_T2 LDI_SUM_T2 NSES_YEAR NSES MENTAL_DISORDER_T1 MENTAL_DISORDER_T2
117 83 2 2 73.70 77.37 27.54 73.47 103.96 93.38 163.34 ... 21.00 27.33 1.09 0.72 0.77 0.61 2009.36 -0.87 1.68 1.82
118 84 1 1 74.81 78.12 26.98 84.14 100.81 100.53 176.55 ... 23.00 31.00 0.87 0.64 0.83 0.43 2009.26 -0.79 1.04 0.30
119 84 2 2 74.74 78.32 28.16 75.62 105.17 95.49 163.86 ... NaN NaN 1.19 0.88 0.87 0.59 2009.28 -0.77 1.63 1.00
120 85 1 1 75.85 79.37 26.75 83.09 100.36 100.25 176.34 ... NaN NaN 1.01 0.86 0.91 0.54 2009.50 -0.74 1.07 1.16
121 85 2 1 76.05 79.44 27.42 72.83 104.52 92.67 162.94 ... NaN NaN 1.15 0.83 0.88 0.69 2009.42 -0.79 0.97 2.12

5 rows × 72 columns

In [12]:
# Check the total number of missing values
f"The total number of missing values are {age.isnull().sum().sum()}"
Out[12]:
'The total number of missing values are 116'
In [13]:
age.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 72 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   AGE                     122 non-null    int64 
 1   GENDER                  122 non-null    int64 
 2   GROUP_SIZE_CAT          122 non-null    int64 
 3   AGE_T1                  122 non-null    object
 4   AGE_T2                  122 non-null    object
 5   BMI_T1                  122 non-null    object
 6   WEIGHT_T1               122 non-null    object
 7   HIP_T1                  122 non-null    object
 8   WAIST_T1                122 non-null    object
 9   HEIGHT_T1               122 non-null    object
 10  BMI_T2                  122 non-null    object
 11  WEIGHT_T2               122 non-null    object
 12  HIP_T2                  122 non-null    object
 13  WAIST_T2                122 non-null    object
 14  HEIGHT_T2               122 non-null    object
 15  EDUCATION_LOWER_T1      122 non-null    object
 16  EDUCATION_LOWER_T2      122 non-null    object
 17  WORK_T1                 122 non-null    object
 18  WORK_T2                 122 non-null    object
 19  LOW_QUALITY_OF_LIFE_T1  122 non-null    object
 20  LOW_QUALITY_OF_LIFE_T2  122 non-null    object
 21  DBP_T1                  122 non-null    object
 22  DBP_T2                  122 non-null    object
 23  HBF_T1                  122 non-null    object
 24  HBF_T2                  122 non-null    object
 25  MAP_T1                  122 non-null    object
 26  MAP_T2                  122 non-null    object
 27  SBP_T1                  122 non-null    object
 28  SBP_T2                  122 non-null    object
 29  HTN_MED_T1              122 non-null    object
 30  CHO_T1                  122 non-null    object
 31  GLU_T1                  122 non-null    object
 32  CHO_T2                  122 non-null    object
 33  GLU_T2                  122 non-null    object
 34  RESPIRATORY_DISEASE_T1  122 non-null    object
 35  PACKYEARS               122 non-null    object
 36  SMOKING                 122 non-null    object
 37  METABOLIC_DISORDER_T1   122 non-null    object
 38  METABOLIC_DISORDER_T2   122 non-null    object
 39  LLDS                    122 non-null    object
 40  Quintile_LLDS           122 non-null    object
 41  ALCOHOL_INTAKE_T1       120 non-null    object
 42  KCAL_INTAKE_T1          120 non-null    object
 43  MWK_VAL                 122 non-null    object
 44  SCOR_VAL                122 non-null    object
 45  MWK_NO_VAL              122 non-null    object
 46  SCOR_NO_VAL             122 non-null    object
 47  SPORTS_T1               122 non-null    object
 48  CYCLE_COMMUTE_T1        122 non-null    object
 49  VOLUNTEER_T1            122 non-null    object
 50  PREGNANCIES             61 non-null     object
 51  OSTEOARTHRITIS_T1       122 non-null    object
 52  SLEEP_QUALITY           95 non-null     object
 53  DIAG_CFS_CDC            122 non-null    object
 54  DIAG_FIBROMYALGIA_ACR   122 non-null    object
 55  DIAG_IBS_ROME3          122 non-null    object
 56  C_SUM_T1                119 non-null    object
 57  A_SUM_T1                119 non-null    object
 58  SC_SUM_T1               119 non-null    object
 59  I_SUM_T1                119 non-null    object
 60  E_SUM_T1                119 non-null    object
 61  SD_SUM_T1               119 non-null    object
 62  V_SUM_T1                119 non-null    object
 63  D_SUM_T1                119 non-null    object
 64  LTE_SUM_T1              122 non-null    object
 65  LDI_SUM_T1              122 non-null    object
 66  LTE_SUM_T2              122 non-null    object
 67  LDI_SUM_T2              122 non-null    object
 68  NSES_YEAR               122 non-null    object
 69  NSES                    122 non-null    object
 70  MENTAL_DISORDER_T1      122 non-null    object
 71  MENTAL_DISORDER_T2      122 non-null    object
dtypes: int64(3), object(69)
memory usage: 68.8+ KB

Categories/columns with missing numbers are:¶

Personality traits:

  • C_SUM_T1 - Competence
  • A_SUM_T1 - Anger-Hostility
  • SC_SUM_T1 - Self-consciousness
  • I_SUM_T1 - Impulsivity
  • E_SUM_T1 - Extraversion
  • SD_SUM_T1 - Self-discipline
  • V_SUM_T1 - Vulnerability
  • D_SUM_T1 - Deliberation

( According to the toolkit: Only completed cases, as mean, of the personality questionaire are in the data set.)

Health and lifestyle:

  • ALCOHOL_INTAKE_T1
  • KCAL_INTAKE_T1
  • PREGNANCIES,
  • SLEEP_QUALITY

Only women can get pregnant, so it make sense that there is data for only the females.

In [14]:
# Uniques and counts for the Age column
age.AGE.value_counts()
Out[14]:
25    2
56    2
58    2
59    2
60    2
     ..
50    2
51    2
52    2
53    2
85    2
Name: AGE, Length: 61, dtype: int64

Age categories range from 25 to 85 years old and are sperated in male and female

In [15]:
# Change the dtype of the data frame to floats
age = age.astype(float)
In [16]:
# First three columns should stay integers
age = age.astype({'AGE':'int', 'GENDER':'int','GROUP_SIZE_CAT':'int'})
In [17]:
# Check if the changes worked
age.head()
Out[17]:
AGE GENDER GROUP_SIZE_CAT AGE_T1 AGE_T2 BMI_T1 WEIGHT_T1 HIP_T1 WAIST_T1 HEIGHT_T1 ... V_SUM_T1 D_SUM_T1 LTE_SUM_T1 LDI_SUM_T1 LTE_SUM_T2 LDI_SUM_T2 NSES_YEAR NSES MENTAL_DISORDER_T1 MENTAL_DISORDER_T2
0 25 1 2 17.15 20.00 21.53 72.20 89.85 78.03 182.96 ... 17.85 26.57 0.97 2.23 0.57 2.30 2010.0 -0.67 0.33 1.07
1 25 2 4 17.20 19.96 22.27 64.91 91.80 75.34 170.67 ... 20.21 26.35 1.01 3.30 0.66 3.37 2010.0 -0.67 1.40 1.87
2 26 1 3 17.89 21.06 22.15 74.92 91.49 79.89 183.78 ... 17.96 26.16 1.00 2.32 0.51 2.46 2010.0 -0.66 0.41 2.11
3 26 2 5 18.08 21.07 22.57 65.84 92.87 76.39 170.78 ... 20.49 26.63 1.12 3.43 0.81 3.72 2010.0 -0.69 1.60 3.20
4 27 1 3 18.74 21.84 22.57 75.66 92.13 80.82 182.94 ... 18.38 26.37 1.03 2.51 0.79 2.41 2010.0 -0.86 1.11 1.76

5 rows × 72 columns

In [18]:
# Descriptive stats of data frame
age.describe().T
Out[18]:
count mean std min 25% 50% 75% max
AGE 122.0 55.000000 17.679423 25.00 40.0000 55.000 70.0000 85.00
GENDER 122.0 1.500000 0.502062 1.00 1.0000 1.500 2.0000 2.00
GROUP_SIZE_CAT 122.0 9.811475 4.627142 1.00 6.0000 10.500 15.0000 15.00
AGE_T1 122.0 45.954918 17.519626 17.15 30.8500 45.750 61.0775 76.05
AGE_T2 122.0 49.544016 17.547281 19.96 34.5850 49.460 64.5725 79.44
... ... ... ... ... ... ... ... ...
LDI_SUM_T2 122.0 1.922131 0.928080 0.43 0.8925 2.185 2.5675 3.72
NSES_YEAR 122.0 2009.511148 0.239190 2008.66 2009.4000 2009.495 2009.6375 2010.00
NSES 122.0 -0.676475 0.150640 -1.06 -0.7475 -0.640 -0.5625 -0.42
MENTAL_DISORDER_T1 122.0 1.607541 0.568474 0.33 1.1800 1.515 2.1300 2.60
MENTAL_DISORDER_T2 122.0 1.838361 0.812315 0.19 1.1700 1.865 2.4100 3.77

72 rows × 8 columns

Select the columns that might be relevant for the research question

In [19]:
age_new = age[['AGE','EDUCATION_LOWER_T1', 'OSTEOARTHRITIS_T1','GROUP_SIZE_CAT', 'GENDER']]
In [20]:
f"The total number of missing values are {age_new.isnull().sum().sum()}"
Out[20]:
'The total number of missing values are 0'

The columns that seems relevant and are selected don't have any data missing, so no interpolation is needed.

In [21]:
# Make a copy of the newly created dataframe
df = age_new.copy()

According to the toolkit of lifelines:
Gender is:

  • Male = 1
  • Female = 2
In [22]:
# Change the values of the gender column to their corresponding name
df['GENDER'] = df['GENDER'].replace([1, 2], ['Male', 'Female'])
In [23]:
df.head()
Out[23]:
AGE EDUCATION_LOWER_T1 OSTEOARTHRITIS_T1 GROUP_SIZE_CAT GENDER
0 25 57.03 0.0 2 Male
1 25 53.23 0.0 4 Female
2 26 46.94 0.0 3 Male
3 26 42.86 0.0 5 Female
4 27 37.11 0.0 3 Male

Correlation heatmap¶

Seaborn¶

See if there correlations amongs the variables:

In [24]:
df.corr().abs()
/tmp/ipykernel_762813/3426995566.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.corr().abs()
Out[24]:
AGE EDUCATION_LOWER_T1 OSTEOARTHRITIS_T1 GROUP_SIZE_CAT
AGE 1.000000 0.792093 0.857631 0.241149
EDUCATION_LOWER_T1 0.792093 1.000000 0.878771 0.515433
OSTEOARTHRITIS_T1 0.857631 0.878771 1.000000 0.348120
GROUP_SIZE_CAT 0.241149 0.515433 0.348120 1.000000
In [25]:
#correlation matrix
corr = df.corr().abs()
sns.heatmap(corr, annot=True)
/tmp/ipykernel_762813/32526792.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = df.corr().abs()
Out[25]:
<AxesSubplot: >

Bokeh¶

In [26]:
c = df.corr().abs()
# Set the x and y range for the heatmap
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
c
/tmp/ipykernel_762813/4163679710.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  c = df.corr().abs()
Out[26]:
AGE EDUCATION_LOWER_T1 OSTEOARTHRITIS_T1 GROUP_SIZE_CAT
AGE 1.000000 0.792093 0.857631 0.241149
EDUCATION_LOWER_T1 0.792093 1.000000 0.878771 0.515433
OSTEOARTHRITIS_T1 0.857631 0.878771 1.000000 0.348120
GROUP_SIZE_CAT 0.241149 0.515433 0.348120 1.000000
In [27]:
# Reshape
dfc = pd.DataFrame(c.stack(), columns=['r']).reset_index()
dfc.head()
Out[27]:
level_0 level_1 r
0 AGE AGE 1.000000
1 AGE EDUCATION_LOWER_T1 0.792093
2 AGE OSTEOARTHRITIS_T1 0.857631
3 AGE GROUP_SIZE_CAT 0.241149
4 EDUCATION_LOWER_T1 AGE 0.792093
In [28]:
# Heatmap based on the heatmap example from the gitbook
def heatmap():
    #create colormapper 
    mapper = LinearColorMapper(palette=Viridis256, low=dfc.r.min(), high=dfc.r.max())

    #create plot
    p = figure(title="Correlation heatmap life lines data", plot_width=500, plot_height=450,
               x_range=x_range, y_range=y_range, x_axis_location="above", toolbar_location=None)


    source = ColumnDataSource(dfc)

    text_props = dict(source=source, text_align="left", text_baseline="middle")

    #use mapper to fill the rectangles in the plot
    p.rect(x="level_0", y="level_1", width=1, height=1, source=source,
           line_color=None, fill_color=transform('r', mapper))

    # Add the values of r to the rectangles
    # Round the numbers to fit the box
    dfc['rtest'] = round(dfc['r'],3)
    # Text needs to be string
    dfc['rfinal'] = dfc['rtest'].astype(str)

    # Add the text to the boxes (from bokeh documentation: categorical plots)
    text_props = dict(source=dfc, text_align="left", text_baseline="middle")
    x = dodge("level_0", -0.4, range=p.x_range)
    p.text(x=x, y="level_1", text="rfinal", text_font_style="bold", **text_props)

    #create and add colorbar to the right
    color_bar = ColorBar(color_mapper=mapper, location=(0, 0),
                         ticker=BasicTicker(desired_num_ticks=len(x_range)), 
                         formatter=PrintfTickFormatter(format="%.1f"))

    p.add_layout(color_bar, 'right')


    #draw axis
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "10px"
    p.axis.major_label_standoff = 0
    p.xaxis.major_label_orientation = 1.0
    show(p)
In [29]:
heatmap()

There seems to be a strong positive relation between lower education and osteoarthritis (0.879).
Also, there seems to be strong positive relations between lower education and age (0.792) and even more for osteoarthritis and age (0.858).

Check distribution/normality¶

Q_Q plot¶

A QQ-plot is used to visually determine how close a sample is to a the Normal distribution. If the points fall roughly on the diagonal line, then the samples can be considered to be distributed normal.

Education¶

In [30]:
# Q_Q plot of lower education (statsmodels)
fig = sm.qqplot(df.EDUCATION_LOWER_T1, fit = True, line = '45')
plt.show()
In [31]:
# Q_Q plot Lower education (DS1 bayesian statistics)
DS_Q_Q_Plot(df.EDUCATION_LOWER_T1)
Estimation method: robust
n = 122, mu = 32.98, sigma = 20.74
Expected number of data outside CI: 6

With the Bayesian DS1 Q_Q plot you can see that the data falls largely within the 95% confidence interval.
Some points fall outside, but still with the expected number.
The data at the end bends from the diagonal line, so the data from education might not be normally distributed.

Osteoarthritis¶

In [32]:
# Q_Q plot of Osteoarthritis (statsmodels)
fig = sm.qqplot(df.OSTEOARTHRITIS_T1, fit = True, line = '45')
plt.show()
In [33]:
# Q_Q plot Osteoarthrithis
DS_Q_Q_Plot(df.OSTEOARTHRITIS_T1)
Estimation method: robust
n = 122, mu = 4.95, sigma = 9.85
Expected number of data outside CI: 6

Most of the data falls outside of the 95% CI and is definitely not linear (more sigmoidal)
The data of osteoarthritis does not look normally distributed.

Evenly distributed (uniform) data yield a sigmoidal Q-Q plot. The Q_Q plots seem to be sigmoidal, so data might be uniform. Check this with histogram.

Histograms of data¶

Before assessing the distribution statistically, plot the distribution first.

Hvplot¶

In [34]:
histogram = df.hvplot.hist('EDUCATION_LOWER_T1', ylabel='counts', alpha = 0.4, 
                           title='Histogram lower education percentage').opts(toolbar=None)

histogram
Out[34]:
In [35]:
# Histogram of education by gender
histogram = df.hvplot.hist('OSTEOARTHRITIS_T1',ylabel='counts', alpha = 0.4,
                           title='Histogram osteoarthritis percentage').opts(toolbar=None)
histogram
Out[35]:

DS1 Bayesian¶

In [36]:
print('Education')
DS_Q_Q_Hist(df.EDUCATION_LOWER_T1)
print('Osteoarthritis')
DS_Q_Q_Hist(df.OSTEOARTHRITIS_T1)
Education
Estimation method: robust
mu = 32.98, sigma = 20.74
Osteoarthritis
Estimation method: robust
mu = 4.95, sigma = 9.85

For both types of histograms, lower education and osteoarthritis do not seem to be normally distributed.

Variation histogram: density plot¶

A density plot predicts the count value and smoothens across the x-axis. They're not affected by bins, so they are better at determining the distribution shape

In [37]:
# Density plot education by gender
df.hvplot.kde('EDUCATION_LOWER_T1', alpha=0.5, title='Density plot of education level')
Out[37]:
In [38]:
# Density plot education by gender
df.hvplot.kde('OSTEOARTHRITIS_T1', alpha=0.5, title='Density plot of Osteoarthritis')
Out[38]:

Statistical analysis¶

Normality check - ShapiroWilk test¶

In [39]:
stat, p = shapiro(df['EDUCATION_LOWER_T1'])
print('Lower education H0: The data is normally distributed')
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Data looks Gaussian (fail to reject H0)')
else:
    print('Data does not look Gaussian (reject H0)')
Lower education H0: The data is normally distributed
Statistics=0.957, p=0.001
Data does not look Gaussian (reject H0)
In [40]:
stat, p = shapiro(df['OSTEOARTHRITIS_T1'])
print('Osteoarthritis H0: The data is normally distributed')
print('Statistics=%.3f, p=%.10f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
Osteoarthritis H0: The data is normally distributed
Statistics=0.835, p=0.0000000002
Sample does not look Gaussian (reject H0)

The data is definitely not normally distributed.

When the data is not normally distributed, the type of analysis changes. Linear models, like t-test, ANOVA and linear regression go by the assumption that the data is normally distributed. An option is to analyze the data by using nonparametric tests, which do not make any assumption about the underlying distribution of the data.

https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/parametric-and-non-parametric-data/

Non parametric correlation - Spearman correlation¶

In [41]:
from scipy.stats import spearmanr
rho, p = spearmanr(df['EDUCATION_LOWER_T1'],df['OSTEOARTHRITIS_T1'])
print('rho=%.3f, p=%.3f' % (rho,p))
alpha = 0.05
# Type of correlation
if rho > 0:
    print('There is a positive correlation between lower education and osteoarthritis')
else:
    print('There is a negative correlation between lower education and osteoarthritis')

# Significance    
if p > alpha:
    print('The relationship between lower education and osteoarthritis is statistically not significantly')
else:
    print('The relationship between lower education and osteoarthritis is statistically significant')
rho=0.807, p=0.000
There is a positive correlation between lower education and osteoarthritis
The relationship between lower education and osteoarthritis is statistically significant

There is a significant positive correlation between lower education and oseoarthritis.

In [42]:
rho, p = spearmanr(df['AGE'],df['OSTEOARTHRITIS_T1'])
print('rho=%.3f, p=%.3f' % (rho,p))
alpha = 0.05
# Type of correlation
if rho > 0:
    print('There is a positive correlation between age and osteoarthritis')
else:
    print('There is a negative correlation between age and osteoarthritis')

# Significance    
if p > alpha:
    print('The relationship between age and osteoarthritis is statistically not significantly')
else:
    print('The relationship between age and osteoarthritis is statistically significant')
rho=0.950, p=0.000
There is a positive correlation between age and osteoarthritis
The relationship between age and osteoarthritis is statistically significant
In [43]:
rho, p = spearmanr(df['AGE'],df['EDUCATION_LOWER_T1'])
print('rho=%.3f, p=%.3f' % (rho,p))
alpha = 0.05
# Type of correlation
if rho > 0:
    print('There is a positive correlation between age and osteoarthritis')
else:
    print('There is a negative correlation between age and osteoarthritis')

# Significance    
if p > alpha:
    print('The relationship between age and osteoarthritis is statistically not significantly')
else:
    print('The relationship between age and osteoarthritis is statistically significant')
rho=0.797, p=0.000
There is a positive correlation between age and osteoarthritis
The relationship between age and osteoarthritis is statistically significant

The values of the correlations change based on the type of test used:

In [44]:
# Default correlation test
print('Default settings')
df.corr().abs()
Default settings
/tmp/ipykernel_762813/2806251345.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.corr().abs()
Out[44]:
AGE EDUCATION_LOWER_T1 OSTEOARTHRITIS_T1 GROUP_SIZE_CAT
AGE 1.000000 0.792093 0.857631 0.241149
EDUCATION_LOWER_T1 0.792093 1.000000 0.878771 0.515433
OSTEOARTHRITIS_T1 0.857631 0.878771 1.000000 0.348120
GROUP_SIZE_CAT 0.241149 0.515433 0.348120 1.000000
In [45]:
# Change method to spearman
print('Spearman test')
df.corr(method='spearman').abs()
Spearman test
/tmp/ipykernel_762813/2667412546.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.corr(method='spearman').abs()
Out[45]:
AGE EDUCATION_LOWER_T1 OSTEOARTHRITIS_T1 GROUP_SIZE_CAT
AGE 1.000000 0.796946 0.950445 0.227519
EDUCATION_LOWER_T1 0.796946 1.000000 0.807291 0.463873
OSTEOARTHRITIS_T1 0.950445 0.807291 1.000000 0.155002
GROUP_SIZE_CAT 0.227519 0.463873 0.155002 1.000000

The heatmap would look different too:

In [46]:
c = df.corr(method='spearman').abs()
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
dfc = pd.DataFrame(c.stack(), columns=['r']).reset_index()

heatmap()
/tmp/ipykernel_762813/2418570910.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  c = df.corr(method='spearman').abs()

Based on the Spearman test:

  • Even though lower, there is still a strong positive relationship between lower education and osteoarthritis (0.807)
  • There is an even higher correlation between age and osteoarthritis (0.95)
  • The correlation between age and lower education stays relatively the same (0.797)

Visualization¶

In [47]:
# Accesible Palettes: Bright colors
# https://personal.sron.nl/~pault/
palette = ['#EE6677','#66CCEE']

# Define factors for plots
GENDER = sorted(df.GENDER.unique())
MARKERS = ['circle', 'hex']
In [48]:
def edu_osteo():
    
    # Create Figure
    p = figure(title = "Relationship lower education level and osteoarthritis", toolbar_location=None,
               plot_width = 700, plot_height = 700)

    # Make plot
    p.scatter("EDUCATION_LOWER_T1", "OSTEOARTHRITIS_T1", source=df,
              legend_group="GENDER", fill_alpha=0.4, size=15, line_width=1.5,
              marker=factor_mark('GENDER', MARKERS, GENDER),
              color=factor_cmap('GENDER', palette, GENDER))
    
    #Legend
    p.legend.location = "top_left"
    
    #Axis labels
    p.xaxis.axis_label = 'Percentage osteoarthritis (%)'
    p.yaxis.axis_label = 'Percentage lower education level (%)'
    
    # Hovertool
    hover = HoverTool(tooltips=[('Education','@EDUCATION_LOWER_T1{0,0.00}%'), 
                                ("Osteoarthritis", "@OSTEOARTHRITIS_T1{0,0.00}%")])
    p.add_tools(hover)
    return p
In [49]:
def age_osteo():
    
    # Create Figure
    p = figure(title = "Percentage osteoarthritis by age",toolbar_location=None,
               plot_width = 700, plot_height = 700)
    
    # Make plot
    r = p.scatter("AGE", "OSTEOARTHRITIS_T1", source=df,
              legend_group="GENDER", fill_alpha=0.4, size=15, line_width=1.5,
              marker=factor_mark('GENDER', MARKERS, GENDER),
              color=factor_cmap('GENDER', palette, GENDER))
    #Legend
    p.legend.location = "top_left"
    
    # Axis
    p.yaxis.axis_label = 'Percentage osteoarthritis (%)'
    p.xaxis.axis_label = 'Age in years'
    
    # Hovertool
    hover = HoverTool(tooltips=[('Age','@AGE'), 
                                ("Osteoarthritis", "@OSTEOARTHRITIS_T1{0,0.00}%")])
    p.add_tools(hover)
    return p
In [50]:
def age_edu():
    
    # Create Figure
    p = figure(title = "Percentage lower education level by age", toolbar_location=None,
               plot_width = 700, plot_height = 700)
       
    # Make plot
    p.scatter("AGE", "EDUCATION_LOWER_T1", source=df,
              legend_group="GENDER", fill_alpha=0.4, size=15,line_width=1.5,
              marker=factor_mark('GENDER', MARKERS, GENDER),
    
              color=factor_cmap('GENDER', palette, GENDER))
    #Legend
    p.legend.location = "top_left"
    
    #Axis labels
    p.yaxis.axis_label = 'Percentage lower education (%)'
    p.xaxis.axis_label = 'Age in years'
    
    # Hovertool to see the values in the plot
    hover = HoverTool(tooltips=[('Age','@AGE'), 
                                ("Lower education", "@EDUCATION_LOWER_T1{0,0.00}%")])
    p.add_tools(hover)
    return p
In [51]:
# Create tabs to switch between plots
tabs = pn.Tabs()
tabs.append(('Lower education - Osteoarthritis', edu_osteo))
tabs.append(('Age - Osteoarthritis', age_osteo))
tabs.append(('Age - Lower education', age_edu))
In [52]:
tabs
Out[52]:

Discussion¶

Looking at the plots, the relation between lower education and osteoarthritis seems to follow a different pattern for men and women, but is higher for women. Women seem to have higher percentage of lower education and osteoarthritis than men. The pattern/trend of comparing age to osteoarthritis for men and women looks similar to the relation between education and osteoarthritis. How older someone is the higher the percentage of osteoarthritis.
When comparing age to lower education level you can also observe an positive relation. The higher the age, the higher the percentage of lower education.

Women aged 65-85 seem to have higher percentages of osteoarthritis then the men in the same age categories. Around 65 years old you see the women surpassing the men in percentage lower education level as well. Women are more prone to developing osteoarthritis than men, especially after the age of 50 (possibly due to menopause).
https://www.arthritis-health.com/blog/why-are-women-more-prone-osteoarthritis
https://www.niams.nih.gov/health-topics/osteoarthritis

Joint overuse such as knee bending and repetitive stress on a joint, can damage a joint and increase the risk of Osteoarthritis in that joint. It is possible that due to having an lower education people people have more jobs related to manual labor and occupations involving manual labor might increase the chance of having Osteoarthritis. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4562436/

There are seem to be a few outliers when you compare for all the participants in the dataset.
There are a few age categories that have high lower education level, but low osteoarthritis. When looking at age to education level you can see that for the categories 25-29 there are higher percentages in lower education(and do not follow the trend). The low lying (outlier) data points in the relation between osteoarthritis and education level could very well be these younger age categories, because osteoarthritis is more common as people age.

It is interesting to see that the current 25-29 age groups seem to have higher lower education percentages compared to what you see for the ages 30 to 85. There are various reasons and factors as why this could be the case. One possible explanation could be that young people take some time in deciding what to study and go for higher education later in life. Another explanation could be that younger people with higher educaions have moved away and do not live in these regions anymore or have simply not participated in this study.


Conclusion¶

In short:

  • The higher the percentage of lower education, the higher the percentage of osteoarthritis.
  • There is a trend where the older people get the higher the percentage of lower education gets.
  • The older the people are, the higher the percentage osteoarthritis is.

The research question was: Is there a relation between lower education level and osteoarthritis?
The results show that there is a positive significant relation between lower education and osteoarthritis.